Personal selection of AI research articles

A collection of well-written articles that I enjoyed reading and helped me grow as an AI professional

Photo by Janko Ferlic on Unsplash

Reading research articles is an important habit as a professional in the AI field. I confess that I have not read all the important articles in that domain, and in addition, this page is often updated as I try to add information about articles that I previously read. Currently, instead of providing an extensive list of all the seminal AI papers (that I will strive to read and add later), I mention the papers that I have already read and that I feel were a significant stepping stone in my AI journey. I try to strike a balance between all fields of AI, although there might be more papers in the computer vision section as that speciality is an important part of my academic and professional path until now. Even though machine learning is my main interest, I also propose papers outside of that area and that help understand current developments in machine learning or better grasp the foundations of certain subfields of AI (e.g., classical image processing). Last, on top of mentioning very important papers, I will also try to add some papers that are less cited and less well known but that I think are well written and were also part of my personal learning journey in AI.

Deep generative modeling

Generative modeling is an unsupervised form of machine learning where the model learns to discover the patterns in input data. Among these deep generative models, two major families stand out and deserve a special attention: Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). VAEs are autoencoders that tackle the problem of the latent space irregularity of classical autoencoder neural networks by making the encoder return a distribution over the latent space instead of a single point. The encodings' distribution is regularised during training to ensure that its latent space has good properties allowing us to generate some new data. The loss function of VAEs, composed of a reconstruction term and a regularisation term (the Kullback-Leibler divergence between the prior latent distribution and its approximated distribution given the input data) are derived using variational inference. The following articles by Joseph Rocca on Towards Data Science were helpful for understanding the paper: "Understanding Variational Autoencoders (VAEs)" and "Bayesian inference problem, MCMC and variational inference". I also found the following Stack Exchange post about the calculation of the Kullback-Leibler divergence between two multivariate Gaussians useful.

Computer Vision

Classical computer vision

This paper presents the scale-invariant feature transform (SIFT), a method to extract feature points and corresponding descriptors (or feature vectors) from images that are invariant to scale and rotation, and robust to affine distorsion, 3D viewpoint change, noise, and illumination. First, a scale-space pyramidal representation of the image is constructed and extremas corresponding to blobs are located in that space using 3D quadratic function optimization after computation of the difference of Gaussians. Keypoints with low contrast and corresponding to strong edges are eliminated. Each keypoint (extrema) is characterized by its 2D position, scale, and orientation, derived from the local orientation histogram. The local image descriptor is composed of the binned local orientation histograms along the keypoint direction around the keypoint for each neighbor box. I found that the videos of Pratik Jain about homography, image registration, the Harris corner detector and its properties, and the SIFT invariant features and feature descriptors were very helpful to understand the context of the paper and its concepts.

Computer vision with deep learning

This paper uses fully convolutional networks (FCN), that is, networks that only use convolutions and no fully connected layers, to perform segmentation of natural images. In opposition to classical ConvNets for image classification, FCNs do not have a fixed input size image; they also have low time complexity. 1D convolutions are used to replace fully connected layers to produce a coarse heat map where each channel represents a class, and “deconvolution” or upsampling layers are used to map features at the coarse level to larger 2D output segmentation results. Combining the information from features at different depths in the network using skip connections help refine the dense output by adding localization information to more content-related features. I found the implementation in the d2l.ai book helpful to understand the paper, although the skip layers were not implemented.

U-Net is a neural network for image segmentation that uses relatively symmetric contracting and expanding paths (that hence form a U shape) with skip connections. That architecture leads to a good balance between localization accuracy and the use of context, while keeping the computing cost low. It is based on the FCN architecture, but the difference is that the U-Net has many feature channels in the up-sampling part. 

This paper introduces feature pyramid networks (FPN), a framework that uses the inherent multi-scale pyramidal hierarchy of ConvNets with low-resolution semantically strong features and high-resolution but semantically weak features, to construct a feature pyramid that has strong semantics at all scales. The bottom-up pathway of the FPN is the feed-forward computation of the backbone ConvNet, computing feature maps at several scales. The subsequent top-down pathway upsamples the feature map from the highest level, that lateral connections enhance at each level by element-wise addition. Prediction is performed at each scale of the bottom-down path. The authors adapt Region Proposal Network (RPN) and Fast R-CNN to the FCN framework respectively for bounding box proposal generation and object detection, and also show high performance with segmentation.

Unsupervised representation learning is successful in natural language processing, but supervised pre-training still prevails in computer vision. Furthermore, constructing large scale labeled datasets is a difficult task, so self-supervised learning could be useful in computer vision. In the SimCLR method, data is augmented using random cropping, resizing, and color distorsion to form positive and negative pairs of images. Two functions f and g are learned, where f corresponds to the learned representation and g is the “projection head” such that the contrastive loss function is minimized for (g o f). That loss function is the normalized cross entropy loss with adjustable temperature. One of the key findings is that unsupervised learning seems to benefit more from scaling up (model size, batch size, training epochs, data augmentation) than supervised learning.

Published on January 5, 2023, last update on February 5, 2023